context image
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- North America > Canada (0.04)
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
NL-Eye: Abductive NLI for Images
Ventura, Mor, Toker, Michael, Calderon, Nitay, Gekhman, Zorik, Bitton, Yonatan, Reichart, Roi
Will a Visual Language Model (VLM)-based bot warn us about slipping if it detects a wet floor? Recent VLMs have demonstrated impressive capabilities, yet their ability to infer outcomes and causes remains underexplored. To address this, we introduce NL-Eye, a benchmark designed to assess VLMs' visual abductive reasoning skills. NL-Eye adapts the abductive Natural Language Inference (NLI) task to the visual domain, requiring models to evaluate the plausibility of hypothesis images based on a premise image and explain their decisions. NL-Eye consists of 350 carefully curated triplet examples (1,050 images) spanning diverse reasoning categories: physical, functional, logical, emotional, cultural, and social. The data curation process involved two steps - writing textual descriptions and generating images using text-to-image models, both requiring substantial human involvement to ensure high-quality and challenging scenes. Our experiments show that VLMs struggle significantly on NL-Eye, often performing at random baseline levels, while humans excel in both plausibility prediction and explanation quality. This demonstrates a deficiency in the abductive reasoning capabilities of modern VLMs. NL-Eye represents a crucial step toward developing VLMs capable of robust multimodal reasoning for real-world applications, including accident-prevention bots and generated video verification.
- Europe > Austria > Vienna (0.14)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
- (2 more...)
- Workflow (1.00)
- Research Report > New Finding (0.67)
Disrupting Diffusion-based Inpainters with Semantic Digression
Son, Geonho, Lee, Juhun, Woo, Simon S.
The fabrication of visual misinformation on the web and social media has increased exponentially with the advent of foundational text-to-image diffusion models. Namely, Stable Diffusion inpainters allow the synthesis of maliciously inpainted images of personal and private figures, and copyrighted contents, also known as deepfakes. To combat such generations, a disruption framework, namely Photoguard, has been proposed, where it adds adversarial noise to the context image to disrupt their inpainting synthesis. While their framework suggested a diffusion-friendly approach, the disruption is not sufficiently strong and it requires a significant amount of GPU and time to immunize the context image. In our work, we re-examine both the minimal and favorable conditions for a successful inpainting disruption, proposing DDD, a "Digression guided Diffusion Disruption" framework. First, we identify the most adversarially vulnerable diffusion timestep range with respect to the hidden space. Within this scope of noised manifold, we pose the problem as a semantic digression optimization. We maximize the distance between the inpainting instance's hidden states and a semantic-aware hidden state centroid, calibrated both by Monte Carlo sampling of hidden states and a discretely projected optimization in the token space. Effectively, our approach achieves stronger disruption and a higher success rate than Photoguard while lowering the GPU memory requirement, and speeding the optimization up to three times faster.
DeepCompass: AI-driven Location-Orientation Synchronization for Navigating Platforms
Lee, Jihun, Choi, SP, Kang, Bumsoo, Seok, Hyekyoung, Ahn, Hyoungseok, Jung, Sanghee
In current navigating platforms, the user's orientation is typically estimated based on the difference between two consecutive locations. In other words, the orientation cannot be identified until the second location is taken. This asynchronous location-orientation identification often leads to our real-life question: Why does my navigator tell the wrong direction of my car at the beginning? We propose DeepCompass to identify the user's orientation by bridging the gap between the street-view and the user-view images. First, we explore suitable model architectures and design corresponding input configuration. Second, we demonstrate artificial transformation techniques (e.g., style transfer and road segmentation) to minimize the disparity between the street-view and the user's real-time experience. We evaluate DeepCompass with extensive evaluation in various driving conditions. DeepCompass does not require additional hardware and is also not susceptible to external interference, in contrast to magnetometer-based navigator. This highlights the potential of DeepCompass as an add-on to existing sensor-based orientation detection methods.
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.86)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
On the Adversarial Robustness of Multi-Modal Foundation Models
Schlarmann, Christian, Hein, Matthias
Multi-modal foundation models combining vision and language models such as Flamingo or GPT-4 have recently gained enormous interest. Alignment of foundation models is used to prevent models from providing toxic or harmful output. While malicious users have successfully tried to jailbreak foundation models, an equally important question is if honest users could be harmed by malicious third-party content. In this paper we show that imperceivable attacks on images in order to change the caption output of a multi-modal foundation model can be used by malicious content providers to harm honest users e.g. by guiding them to malicious websites or broadcast fake information. This indicates that countermeasures to adversarial attacks should be used by any deployed multi-modal foundation model.
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.14)
- Europe > Netherlands > North Holland > Amsterdam (0.05)
- North America > United States > New York (0.04)
- (3 more...)
- Information Technology > Security & Privacy (0.67)
- Government > Military (0.49)
- Health & Medicine > Therapeutic Area (0.48)
- Leisure & Entertainment > Sports > Tennis (0.46)
MEWL: Few-shot multimodal word learning with referential uncertainty
Jiang, Guangyuan, Xu, Manjie, Xin, Shiji, Liang, Wei, Peng, Yujia, Zhang, Chi, Zhu, Yixin
Without explicit feedback, humans can rapidly learn the meaning of words. Children can acquire a new word after just a few passive exposures, a process known as fast mapping. This word learning capability is believed to be the most fundamental building block of multimodal understanding and reasoning. Despite recent advancements in multimodal learning, a systematic and rigorous evaluation is still missing for human-like word learning in machines. To fill in this gap, we introduce the MachinE Word Learning (MEWL) benchmark to assess how machines learn word meaning in grounded visual scenes. MEWL covers human's core cognitive toolkits in word learning: cross-situational reasoning, bootstrapping, and pragmatic learning. Specifically, MEWL is a few-shot benchmark suite consisting of nine tasks for probing various word learning capabilities. These tasks are carefully designed to be aligned with the children's core abilities in word learning and echo the theories in the developmental literature. By evaluating multimodal and unimodal agents' performance with a comparative analysis of human performance, we notice a sharp divergence in human and machine word learning. We further discuss these differences between humans and machines and call for human-like few-shot word learning in machines.
- Asia > China > Beijing > Beijing (0.04)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.92)
- Education (1.00)
- Health & Medicine > Therapeutic Area (0.67)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Efficient Zero-shot Visual Search via Target and Context-aware Transformer
Ding, Zhiwei, Ren, Xuezhe, David, Erwan, Vo, Melissa, Kreiman, Gabriel, Zhang, Mengmi
Visual search is a ubiquitous challenge in natural vision, including daily tasks such as finding a friend in a crowd or searching for a car in a parking lot. Human rely heavily on relevant target features to perform goal-directed visual search. Meanwhile, context is of critical importance for locating a target object in complex scenes as it helps narrow down the search area and makes the search process more efficient. However, few works have combined both target and context information in visual search computational models. Here we propose a zero-shot deep learning architecture, TCT (Target and Context-aware Transformer), that modulates self attention in the Vision Transformer with target and contextual relevant information to enable human-like zero-shot visual search performance. Target modulation is computed as patch-wise local relevance between the target and search images, whereas contextual modulation is applied in a global fashion. We conduct visual search experiments on TCT and other competitive visual search models on three natural scene datasets with varying levels of difficulty. TCT demonstrates human-like performance in terms of search efficiency and beats the SOTA models in challenging visual search tasks. Importantly, TCT generalizes well across datasets with novel objects without retraining or fine-tuning. Furthermore, we also introduce a new dataset to benchmark models for invariant visual search under incongruent contexts. TCT manages to search flexibly via target and context modulation, even under incongruent contexts.
- North America > United States > Texas > Harris County > Houston (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (2 more...)
Google's DeepMind AI can 'transframe' a single image into a video
Earlier this week, the team behind Google's advanced DeepMind neural network unveiled a new ability dubbed Transframer, which allows AI to generate 30-second videos from a single image input. It's a nifty little trick at first glance, but the implications are much larger than an interesting .GIF file. Transframer is a general-purpose generative framework that can handle many image and video tasks in a probabilistic setting. New work shows it excels in video prediction and view synthesis, and can generate 30s videos from a single image: https://t.co/wX3nrrYEEa "Transframer is state-of-the-art on a variety of video generation benchmarks, and… can generate coherent 30 second videos from a single image without any explicit geometric information," the DeepMind research team explains.
Graph-Based Continual Learning
Tang, Binh, Matteson, David S.
Despite significant advances, continual learning models still suffer from catastrophic forgetting when exposed to incrementally available data from non-stationary distributions. Rehearsal approaches alleviate the problem by maintaining and replaying a small episodic memory of previous samples, often implemented as an array of independent memory slots. In this work, we propose to augment such an array with a learnable random graph that captures pairwise similarities between its samples, and use it not only to learn new tasks but also to guard against forgetting. Empirical results on several benchmark datasets show that our model consistently outperforms recently proposed baselines for task-free continual learning.
- North America > United States > New York > Tompkins County > Ithaca (0.04)
- Europe > Hungary > Hajdú-Bihar County > Debrecen (0.04)
- Health & Medicine (0.51)
- Education (0.46)